Global coordination level in single-cell transcriptomic data

Genes are linked by underlying regulatory mechanisms and by jointly implementing biological functions, working in coordination to apply different tasks in the cells. Assessing the coordination level between genes from single-cell transcriptomic data, without a priori knowledge of the map of gene regulatory interactions, is a challenge. A ‘top-down’ approach has recently been developed to analyze single-cell transcriptomic data by evaluating the global coordination level between genes (called GCL). Here, we systematically analyze the performance of the GCL in typical scenarios of single-cell RNA sequencing (scRNA-seq) data. We show that an individual anomalous cell can have a disproportionate effect on the GCL calculated over a cohort of cells. In addition, we demonstrate how the GCL is affected by the presence of clusters, which are very common in scRNA-seq data. Finally, we analyze the effect of the sampling size of the Jackknife procedure on the GCL statistics. The manuscript is accompanied by a description of a custom-built Python package for calculating the GCL. These results provide practical guidelines for properly pre-processing and applying the GCL measure in transcriptional data.

www.nature.com/scientificreports/ dependencies between the genes. Before doing so however, there are several pre-and post-processing steps which must be taken into account. There are three major steps: (1) Clustering cells into homogeneous groups.
(2) Filtering cells that differ significantly from the other cells ('outliers'), or cells which are too similar to each other ('inliers'). (3) Performing jackknife analysis to ensure the stability of the results. These steps are required because, as we show, the GCL (like other correlation measures) is sensitive to unusual cells and heterogeneous cohorts, while scRNA-seq data is typically sparse, noisy and characterized by outliers and clusters. Based on a systematic analysis of the effect of the above steps, the recommended guidelines presented here are important to ensure stable and reliable results of the GCL measure.

Material and methods
Global coordination level (GCL). As mentioned in the introduction, the GCL is a 'top-down' computational method to evaluate the system-wide multivariate dependency between the genes, by calculating the distance correlation of random gene sets. A set of M measured cells with N genes is represented by the matrix X ∈ R N×M , where each vector column x (ν) (ν = 1 . . . M) is the measurement of an individual cell ν , and each element within the vector x ν i (i = 1 . . . N) is the measured expression value of gene i. The GCL is then performed as follows. The matrix X is divided into two random complementary parts G and G . The two parts are represented by the matrices X G ∈ R N/2×M and X G ∈ R N/2×M such that X G represents one randomly selected subset of the genes of size N/2, while X G represents the rest of the genes ( G ∩ G = ∅ and |G ∪ G| = N ). Then, we calculate the bias-corrected distance correlation (bcdCorr) 17 between groups G and G . The bcdCorr, a refined version of the distance correlation (dCorr) measure 18 , evaluates the level of dependence between two high-dimensional variables by testing how the distance between the samples, with respect to one (high-dimensional) variable, is changed compared to the distance between the same samples with respect to the other (high-dimensional) variable. Accordingly, when applied to gene-expression data, the bcdCorr is a measure of the dependency level between the gene-sets G and G . A short description of how to calculate the bcdCorr between two variables is presented below. The process of calculating the bcdCorr is repeated m times, for different random divisions of the genes. Finally, the GCL is defined as the average across these divisions, i.e., where all m random divisions of the data (X k G , X k G ) are independently chosen. For a large homogeneous cohort, the empirical GCL is bounded between zero and one: zero corresponds to the case of independent gene expression, whereas a significantly non-zero GCL reflects coordinated transcriptional expression, which could be interpreted as a result of underlying molecular dynamics, such as gene-to-gene regulatory interactions.
Calculating the bcdCorr. Here we briefly show the full procedure of calculating the empirical bcdCorr between two variables as defined by Székely and Rizzo 17 . Consider M observations of two high-dimensional variables X i ∈ R p and Y i ∈ R q , i = 1, . . . , M , where X i = (X i,1 , . . . , X i,p ) and Y i = (Y i,1 , . . . , Y i,q ) . Note that q does not necessarily have to be equal to p (i.e., the two variables may have unequal dimensions). The M observations are represented by the p × M matrix X and q × M matrix Y. The empirical bcdCorr(X, Y ) is defined as where and A * i,J and B * i,j are matrices defined as Clusters detection. Methods of identifying and removing clusters from the data are numerous and well studied. In this manuscript, we advise to apply detection and removal of clusters in the following manner. We apply the k-means algorithm, with varying the total number of clusters, 1 < K < 10 . The distance between the samples is measured using the Spearman dissimilarity. We follow that by a Silhouette coefficient analysis that determines the optimal number of clusters. For each of the C k clusters ( k = 1 . . . K ), we calculate the mean distance, a i , between each cell x i and all the other cells which belong to the same cluster, where |C k | is the number of samples in cluster k, and Spearman(x i , x j ) , is the distance between cell i and j based on the Spearman dissimilarity. Additionally, we calculate the smallest mean distance between cell i to all other cells in the other clusters, of which i is not a member, b i , The silhouette score of each cell, s K i ( i = 1 . . . M ) is then defined as where the superscript K indicates that the value is calculated for this particular choice of total number of clusters. nThe value of the silhouette score is bounded −1 < s K i < 1 . Values which are close to 1 means that the data is clustered correctly with respect to cell i. We average the silhouette value across all the cells to get a unified value Finally, the optimal value of K is chosen by maximizing the value of S K with 10 realizations of the k-means algorithm. In general, as each clustering detection method has its own advantages and disadvantages, we recommend supplementing the detection step with a visual inspection using dimensional reduction analysis, such as t-SNE.
Effect of an individual cell on the GCL. In "Removal of abnormal cells" section of this manuscript, we measure the effect of a single outlier or inlier on the GCL value. The relative effect GCL i of cell i on the cohort of cells is defined as where ' GCL with cell i ' is the GCL of the entire cohort and ' GCL without cell i ' is the GCL of the cohort but with cell i excluded. In general, outliers and inliers tend to increase the GCL value of the cohort (see e.g. Figure 1), making the GCL i positive.
Transcription data sets used in this work. In the demonstration presented here we have used single-cell RNA-seq datasets from the following sources (Table 1).
Numerical model for synthetic gene expression data. In the Results we investigate the effect of the presence of clusters in the data on the GCL value with synthetic gene expression data obtained from simulated numerical models of gene regulatory dynamics. The expression profile x (ν) of cell ν is modelled as the steady state of a set of coupled ordinary differential equations (ODEs), representing the gene regulatory dynamics [23][24][25] . Specifically, we use the following set of ODEs, where we set the number of genes, N = 200 . The first term expresses a self degradation of gene i. The second term is responsible for the growth of x   www.nature.com/scientificreports/ In our simulation we use GRNs with random links between the nodes, i.e., each pair of genes is connected with a constant probability in the form of an Erdős-Rényi network with an average degree of three. Finally, the GRN weights w i,j (for existing links) are randomly selected from the uniform distribution U(0, 2) . To create different cells from the same GRN, a random subset of genes in each cell are set as inoperative, i.e., their expression levels are set to zero. Specifically, in our simulations we randomly choose 5 out of the N = 200 genes to be inoperative. The expression profile of each cell x (ν) is generated by solving the GRN differential equations with the same initial conditions ( x i (t = 0) = 0.5 ) and evaluating the steady state using the ode45 MATLAB function.  www.nature.com/scientificreports/ Using this procedure, we create two distinct cohorts of cells, which differ by their GRNs. Each cohort A and B has its own weight matrix, w A i,j and w B i,j . The cohorts has the same number of cells ( M = 50 ) and genes ( N = 200 ). The GRN of the two cohorts have a fraction 1 − p of shared links, i.e., the GRN of cohort A has a p fraction of links that are not present in B and vice-versa. If p = 0 the GRN of the cohorts are identical. As p is increased, the number of shared links is decreased, and when p = 1 the GRNs of the two cohorts have no shared links.
After generating the cells in each cohorts, the sample-to-sample mean distances, D, between the cohorts is measured. It is defined as the mean distance between the cohorts divided by the sum of mean of distances inside each cohort, i.e., where, When the cohorts are generated with the same GRN (i.e., p = 0 ), D = 0.5 . As p is increased, D gets larger. The dependency of the GCL value on these parameters is presented in the Results section.

Results
Here we describe the necessary pre-processing steps and the application of the GCL measure. Specifically, this includes: (1) Removal of abnormal cells, (2) Clustering analysis, and (3) Jackknife re-sampling.
Removal of abnormal cells. In our previous work 15 , during the analysis of the single-cell transcriptomic data, we have noticed the significant sensitivity of the GCL towards single abnormal cells. The sensitivity is not only towards outliers, cells which are far away from the rest of the cells, but also for cells which are unusually similar to other cells, which we refer to here as 'inliers' . The abnormality of the cells with respect to the majority of the cohort can be observed by examining the distribution of 'distances' (Spearman dissimilarity) between all the cells (Fig. 1a). While the cell-to-cell distances among most of the cells are distributed near a characteristic value, there are a small number of cells, associated with exceptionally large or small values, compared with the mean (red and blue shaded area respectively). Figure 1b shows the effects of outliers and inliers of a typical cohort of cells. The data is a cohort of short-term hematopoietic stem cells (ST-HSCs) from young C57 mice 19 . We measure the effect of each individual cell on the GCL, GCL i , defined as the relative difference between the GCL with and without cell i (see "Material and methods" section) For each cell we calculate the minimal distance between it and any of the other cells in the cohort. The minimal distance of an outlier would be relatively large, whilst the minimal distance of an inlier would be relatively low. When plotting GCL versus the minimal distance, a characteristic parabola-like shape is apparent (Fig. 1b), where both outliers and inliers have a significant relative effect on the value of the GCL (where outliers seem to be more significant then inliers). To ensure that the GCL is not biased due to the effect of a few outliers or inliers cells, we recommend filtering them out prior to the analysis. This can be done, for example, by looking into the distribution of distances between all the cells, and removing cells which have some distance which is two standard deviations above/below the mean distance between the cells (Fig. 1a). Note the inherent difference between outliers and inliers with respect to the cohort. When an inlier is present in the data, it means that there are two cells which are abnormally similar to each other. In this case, it is sufficient to remove only one of the cells in order to increase the homogeneity of the cohort. When an outlier is present, there is no arbitrary choice in the filtering process.
To examine this effect thoroughly, we introduce a simulated cell into a cohort of real cells and evaluate its GCL , as follows. First, we chose two random cells from the cohort of real cells (blue and black circles in Fig. 1c). Then, we replace one of these cells with a simulated cell, constructed as a linear combination between the two cells, with a tuning parameter, 0 ≤ ≤ 1 . When = 0 the simulated cell is identical to the replaced cell. When 0 < < 1 , the simulated cell becomes more similar to the other cell. For each value of , we measure both the minimal distance and the GCL associated with the simulated cell. When the minimal distance decreases, the effect on the GCL increases (blue lines in Fig. 1d). We repeat this process for 10 random cells initiations.
We then perform a similar process on an outlier cell, by replacing one of the outlier cells (red circle in Fig. 1c) by a simulated cell, constructed as a linear combination between it and a random cell (the black circle in Fig. 1c). This time, in order to focus on the effect of the outlier with respect to the cohort and not on the similarity between two cells (the 'inliers' effect), we also remove the random cell from the analysis. This manipulation results in a simulated cell that, by tuning the value, move on the trajectory between normal and outlier cell, i.e., when = 0 the simulated cell is identical to the randomly selected cell, and when = 1 the simulated cell is identical to the outlier cell. Figure 1d) shows that when the minimal distance increases, the effect of the simulated cell on the GCL significantly increases. Both the inliers and the outliers' effects, which are shown in Fig. 1d) using simulated cells, demonstrate a similar pattern to the parabola shape observed for the real data.  (Fig. 1e). The figure shows the GCL values of cohorts of cells before and after the removal of outliers/inliers. While it is clear that their presence tends to increase the GCL, the most pronounced effect is visible in the ST-HSC cohort from Ref. 19 (Fig. 1a). There, the presence of a few outlier/inliers (5) can cause the GCL of the jackknife group to dramatically increase to the maximum value (1). In fact, a bimodel distribution can be observed when the outliers/inliers are present, which represent jackknife groups with and without some of these unusual cells.
The effect of clusters. Like other statistical inference tools, the GCL is also sensitive to heterogeneity in the data. A common source of heterogeneity is the presence of clusters in the cohort of cells. In this section, we detail how does the presence of clusters affects the GCL value, as well as provide recommended guidelines on how to filter out clusters. In general, clusters in the cohort tend to increase the GCL values, sometimes even to extreme values close to 1. This effect is analogous to the biases of Pearson correlation values calculated over heterogeneous data. The goal of the GCL is to provide a sense of the strength of the inter-dependencies between the genes. To this aim, we wish to calculate the GCL for isolated homogeneous cohorts, in order to avoid a type of Simpson's paradox, where the presence of two or more clusters in the data affects the apparent relations between the genes (see "Discussion"). Figure 2a1-a2 shows this effect in a heterogeneous cohort, artificially induced by combining multipotent progenitors (MPP) cells from two different mice (ages 3 months ('young') and 22 months ('old')) (from Ref. 20 , see "Material and methods"). Calculating the GCL over the entire cohort yields an extreme value which is very close to 1, whereas the GCL values of each separate group (young and old) have much more sensible values, lower than 0.5 (Fig. 2a2). To verify that this change of the GCL values is not due to the change of sample size (number of cells), we calculate it also over a fourth group, which is comprised of a random assortment of the  Figure 2a2 shows that this group also has an artificially large GCL value. Figure 2b-d also show the effect of heterogeneity, but for naturally occurring clusters (from Ref. 19,20 ). The clusters are marked with blue and green colors. The effect of the clusters on the GCL is evident (although less dramatic than the artificial case in a). A mixed cohort will tend to have larger GCL values than any of the separate clusters. We also verify this effect using numerical models of synthetic gene expression data (see "Material and methods" ). Two cohorts, A and B, of increasingly differentiated gene regulatory network are generated, and the GCL of them is measured (Fig. 3). The cohorts have the same number of cells ( M = 50 ) and genes ( N = 200 ). When the GRN of the two cohorts is identical, the GCL value of each cohort approximately equals to the GCL of the joint cohorts, which is comprised of all the cells from both A and B. As the differences between the GRN of the cohorts increase, the GCL of each individual cohort remains stable. This shows that the GCL, which measures the general dependencies between the genes is not sensitive to small-scale differences. However, the GCL of the combined cohort increases. This is due to the heterogeneity in the data, which is a confounding factor that artificially increases the GCL, similar to what is observed in real data.
We have shown here how the presence of clusters creates artificial inflation of the GCL value. In "Material and methods" , we detail our proposed recommended method of how to detect and filter clusters in the data. Jackknife re-sampling procedure. Generally, the stability of the GCL calculation, i.e., how sensitive is it to the exact cell composition of the cohort, is related to two factors: the homogeneity of the cell population, and the sample size (number of cells in the analyzed cohort). To evaluate the stability of the GCL for a given cohort of cells, we apply the jackknife re-sampling procedure 26 . Specifically, we randomly re-sample s sub-sets of k cells, with repetition, and calculate the GCL of each of them. If the cell population is substantially homogeneous, the variance of the GCL values calculated over the different sub-sets will mainly reflect the finite size effect. In contrast, if the cell population includes a small number of outlier cells (see "Removal of abnormal cells"), they will be sampled and affect the GCL values in only some of the sub-sets, increasing the variability of the GCL values. Therefore, the jackknife procedure is well-suited for the stability estimation of the GCL.
However, re-sampling sub-sets of cells, with repetitions, from a finite cohort introduces the issue of the choice of the overlap level between the different sub-set. Thus, the predetermined value of the k parameter evidently affects the variance of the GCL values. For large values of k, the overlap between the different sub-sets is high, and subsequently the variance of the GCL values will be relatively small. On the other hand, for small values of k (small number of cells in each sub-set), the GCL results are highly vulnerable to finite-size noise and hence the variance of the GCL values will be relatively large.
To test this effect, we systematically calculate the GCL values with the jackknife procedure using different values of k. The data presented in Fig. 4 contains the gene expression profiles of N = 2000 genes (with the largest mean expression) from M = 113 LT-HSCs from a young C57 mouse 19 . While the average GCL is reasonably constant of k, the distributions are increasingly narrower for larger values of k (specifically, when k = 113 , all the subsets are identical and their GCL values converge to a single value). When plotting the standard deviation of the GCL distributions, σ (GCL) , versus the number of cells, k, on log-log plot, two regions can be observed (Fig. 4b). For k < k * , where k * ≈ 85 , σ (GCL) ∼ k −1 (linear relation on the log-log plot at the red colored zone). For k > k * , σ (GCL) rapidly decreases as k approaches the cohort size (blue colored zone). The two regions correspond to the two effects mentioned above. For small values of k, the decrease of σ (GCL) is mainly due to the increased sample size of each jackknife realization. For large values of k, the large overlap between the re-sampled subsets becomes the dominant effect.
These results suggest that k = k * may be considered as an optimal choice, as it decreases the 'finite size effect' with a minimal 'overlap effect' . Next, we test how the value of k * depends on the cohort size. We repeat the same analysis as in Fig. 4b for smaller cohort sizes (selected from the same data-set). Figure 4c shows that the cross-over value, k * , between the linear decay (on the log-log scale) and the rapid decrease of σ (GCL) can be approximated as ≈ 75% of the cohort size.
We repeat the analysis for shuffled data, with a much larger sample size ( M = 1000 ). The shuffling procedure removes the relationship between the genes, causing the average GCL to be equal to 0 (Fig. 4d). By having a cohort size which is much larger than k, the overlap effect is negligible and the sample-size effect is dominant. As shown in Fig. 4e, the sample-size effect yields k −1 dependency of the standard deviation of the GCL, similar to the observation in Fig. 4b. When analyzing cohorts of shuffled data with different sizes, we find that the k * value is about ≈ 75% of the cohort size (Fig. 4f).
To conclude, when analyzing the stability of the GCL results in other cohorts, we recommend to repeat the analysis presented here to find the optimal value of k for the jackknife procedure. Otherwise, we would recommend to choose k ≈ 75% × M as a default value. GCL script description. We publish on GitHub 27 an open-source code package written in Python 3.9 which includes a tutorial and working example of the GCL calculation as well as a GCL library. The tutorial demon- The library can be imported into an existing Python project or executed as a stand-alone program to calculate the GCL for given data (vector or GCL values for the jackknife realizations). Different features of the GCL calculations can be manually chosen. For example, Jackknife_percentage determines the fraction of cells in each jackknife realization from the entire cohort, k/M; and Jackknifes determines the number of jackknife realizations, s. Figure 5 shows a typical output of the Python script for a cohort of LT-HSC cells. The GCL distributions become narrower for larger values of k, where k/M is set to be 0.5, 0.7 and 0.9 in Fig. 5a-c), respectively. A second script, which only calculates the GCL using MATLAB is also freely available 28 .

Discussion
In this manuscript, we present general guidelines and recommended practices on how to appropriately utilize the GCL method on scRNA-seq datasets. A main consideration when applying any bulk analysis method to a cohort of cells, such as co-expression analysis or the GCL, is to ensure that the analyzed cohort is as homogeneous as possible. This does not mean that the transcription profiles of all the cells must be identical, or very similar to each other, but rather that the cell-to-cell variability should not be dominated by a small number of outlier cells or by the presence of distinct clusters. The presence of outliers or clusters may lead to spurious results when applying measures of gene-gene interrelations, such as co-expression analysis or GCL. Therefore, we consider the removal of outliers and clusters as a standard pre-processing procedure and a necessary step in analyses that focus on the interrelations between genes.
'Inliers' , i.e., cells which have unusually similar transcription profiles, is another condition that should be taken into account when calculating the GCL. Two very similar cells could be present due to biological effects, for example, if the sampling corresponded with a recent division of the cell. We show that the GCL may be disproportionately affected by such cases in the cohort, with a tendency to artificially increase its value. We therefore recommend to view such cells as another form of heterogeneity, and to remove one of them before the calculation of the GCL.
We recommend to accompany the calculation of the GCL with a jackknife procedure, where the GCL is calculated for a number of re-sampled sub-sets of cells, yielding a distribution of values. There are two benefits to this step. First, it can serve as an additional layer for testing the homogeneity of the analyzed cohort. Specifically, if the bulk calculation of the GCL is disproportionately affected by individual cells, the subsets that include it will have elevated GCL values that may be observed by examining the distribution (see e.g. Fig. 1e). Such cells may be missed in the initial pre-processing steps described above. Note, however, that the jackknife procedure is not suited for detecting the presence of clusters, since this kind of heterogeneity is preserved in the re-sampled subsets, as demonstrated in Fig. 2. Second, when comparing GCL values from two different cohorts, the jackknife distributions are essential. For this purpose, a rational choice of the re-sampling size (the value of k) is required. We propose a methodology to balance between two extremes: a too small value of k introduces statistical noise due to small sample-size, whereas a value of k that is too close to the cohort size will not generate the representative variability of the GCL, due to the high overlap between the re-sampled sub-sets. Additionally, the jackknife www.nature.com/scientificreports/ re-sampling is an effective way to eliminate the potential bias due to a different number of cells when comparing two or more cohorts of different sizes, by calculating the GCL over the same size of re-sampled subsets. The recommendation to apply the GCL on homogeneous cohorts may seem counter-productive to the usefulness of scRNA-seq to reveal the heterogeneity of the data. Indeed, the heterogeneity can discover hidden information, such as identifying new sub-types of cells. Importantly, we distinguish between two types of heterogeneity: (1) heterogeneous cohorts that contain different cell types/sub-types, or individual outliers/inliers cells; (2) the natural cell-to-cell variability of gene expression profiles among cells of the same tissue, cell type and biological conditions. The goal of the GCL is to unveil underlying order, in terms of gene-gene coordination, from the latter type, i.e., natural cell-to-cell variability. For the GCL analysis, like other correlation analysis methods, heterogeneity of the first type in the analyzed data may be detrimental. It has the proclivity of generating spurious relations, e.g., strong observed correlations due to confounding factors (Simpson's paradox). Therefore, the pre-processing steps should minimize the first type of heterogeneity while preserving the natural cell-to-cell variability.
The added value of the GCL with respect to the traditional bottom-up approaches can be illustrated by the following example. Consider an experiment to test the effect of a given condition by comparing the scRNA-seq data of control and case cohorts. We can divide the apparent differences between these two cohorts into two kinds. Gene-specific differences are 'small-scale' . They can be detected as deferentially expressed genes (DEGs) between the case and control, or as 'differential gene-gene interactions' , which may indicate an alteration of specific elements in the underlying regulatory mechanism. The second type is 'large-scale' differences, i.e., not gene-specific. For example, during aging, cells accumulate stochastic genetic and epigenetic damage that affect each individual cell differently. The group-effect of such random damage is not observed in the same genes across different cells, but rather may be observed as a general decline of the gene-to-gene dependency level. The GCL is a measure of this kind.
In practice, when comparing two cohorts we recommend first to detect any small-scale differences between them using the traditional approaches, e.g. filtering DEGs between them. Then, the GCL can be applied on the remaining genes, i.e., the non-differentially expressed genes. This is beneficial in two ways. On the one hand, it rules out the potential bias due to the effect of the DEGs on the GCL. On the other hand, if there are observed significant differences between the GCL values calculated over the non-differentially expressed genes, then they may be interpreted as a result of large-scale alterations.
As a final remark, the proposed filtering process presented here uses auxiliary algorithms (Silhouette, Spearman-based outlier removals, etc) which might not be best suited in all cases. There is a plethora of cluster identification and outlier removal algorithms in the scientific literature, and it is certainly possible that the use of one of them would be more advantageous than the ones presented here for other sc-RNA datasets. For example, UMAP can replace t-SNE in the visual inspection step, DBSCAN can be used for clusters identification, and the Spearman matrix can be replaced by other dissimilarity measures, such as Euclidean distance, normalized mutual information, adjusted Rand index, Fowlkes-Mallows index or the Jaccard index. The goal of this manuscript is to demonstrate the sensitivity of the GCL to the structure of the data and propose possible solutions.
The GCL is therefore a complementary measure to the existing classical tool-set of gene-expression analysis. We hope that the GCL method, along with the recommended guidelines in this manuscript, be useful in the future analysis of scRNA sequencing.